IMS Collections 

© Institute of Mathematical Statistics, 
arXiv: arXiv : 1206 . 6913vl 



Sampling From A Manifold * 

Persi Diaconis^ , Susan Holmes^ and Mehrdad Shahshahani^ 

Stanford University and U. Teheran 

Abstract: We develop algorithms for sampling from a probability distribution on a submanifold 
embedded in R n . Applications are given to the evaluation of algorithms in 'Topological Statistics'; 
to goodness of fit tests in exponential families and to Neyman's smooth test. This article is partially 
expository, giving an introduction to the tools of geometric measure theory. 

1. Introduction 

A variety of inferential tasks require drawing samples from a probability distribution on a manifold. 
This occurs in sampling from the posterior distribution on constrained parameter spaces (eg covariance 
matrices), in testing goodness of fit for exponential families conditional on sufficient statistics (eg the 
sum and product of the observations in a Gamma family), and in generating data to test algorithms in 
tolopogical statistics. 

In our applications, we found that examples involved domains with corners and non smooth functions 
(eg max(|a;i|, \x2\, ■ • ■ , |^n|))- We found a useful set of tools in geometric measure theory. One of our goals 
is to explain and illustrate this in the usual language of probability and statistics. 

To introduce the subject, consider the following two examples, used as illustrations throughout. 

Example 1A: The Curved Torus Figure 2 shows a picture of 1000 points on the torus 
(1.1) M = {[(R+rcos(6))cos(ip), (R+ r cos(0)) sm(tp), r sin(0)]}, 

< 0, ip < 2tt for R > r > 0. The torus is formed by taking a circle of radius r in the (x, z) plane, 
centered at x — r, z = and rotating it around the z axis. 

Formula (1.1) gives the embedding of A4 as a compact 2-dimensional manifold in M 3 . As such, A4 
inherits a natural area measure: roughly, take a region on Ai, thicken it out by e to be fully 3-dimensional, 
compute the usual volume of the thickened region and take the limit of this area divided by e as e — > 0. 
This area measure can be normalized to be a probability measure H 2 (dx) on M.. The points shown are 
sampled from H 2 (dx). 
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Figure 1: A sample of 1000 points from the naive measure on a torus with R=l,r=0.9 

Note that the sampled points are denser in regions with higher curvature such as the inside of the 
torus. This distribution is from the naive choice: choose 9 and ip uniformly and map onto M. using (1.1). 
Figure (2.3) show both correctly and incorrectly generated points, see next section. 

Such samples, with noise added, are used to calibrate topological algorithms for estimating dimen- 
sion, number of components and homology in the emerging field of topological statistics. Examples such 
as two linked tori on the seven sphere and Klein bottles are shown to come up naturally in image analysis 
(Carlsson, Carlsson and de Silva, 2006). 

Example IB: Testing the Gamma Distribution For fixed n > 3, S, P > 0, let 

n n 

(1.2) M=[{x 1 ,...,x n ); Xi >0, J2 x i = S ' T[xi = P}- 

i=l i=l 

This is a compact (n — 2)-dimensional submanifold in K n . The need for samples from M. comes up in 
testing if random variables X\ , X-i , . . . , X n are independently drawn from the Gamma density 



(1.3) 



x/cr x a— 1 

<7 a Y{a) 



< x < oo, 



with cr, a > unknown parameters. The sufficient statistics for a, a are S = X)"=i P = II"=i^- -"- n 
numerous writings, R. A. Fisher suggested using the conditional distribution of X\, . . . , X n given S, P to 
give exact goodness of fit tests. These ideas are reviewed in section 3 below. The conditional distribution 
has a simple density with respect to T-L n ~ 2 {dx) leading to practical algorithms for random generation and 
testing. The proposed tests are different than the ones in Kallioras, Koutrouvelis and Canavos (2006) or 
Pettitt (1978). Goldman and Whelan (2000) and Yang (2006) explain interesting applications of these 
tests in modern evolutionary analyses of DNA. 



Related Literature 



There has been a steady interest in statistics on manifolds. The development of mean and variance estima- 
tors appears in Pennec (2006) and Bhattacharya and Patrangcnaru (2003). The book by Bhattacharya 
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and Bhattacharya (2012) about data on the shape space manifold contains several interesting results. 
Data on the sphere and the projective space are discussed in Beran (1979), Fisher, Lewis and Embleton 
(1993) and Watson (1983). Data on more general manifolds appear in Ginc (1975). One widespread ex- 
ample occurs in physics and chemistry problems involving configurations of atoms with some inter-atomic 
distances or angles fixed; see Fixman (1974) or Ciccotti and Ryckaert (1986). Any of these settings give 
rise to the need for Monte Carlo sampling on manifolds. 

There are well-known algorithms for sampling from the uniform distribution on compact groups and 
other homogeneous spaces. For instance, Eaton (1983) proves that if an n x n matrix is filled with iid 
standard normals and the QR decomposition is carried out, then the Q part is distributed as the uniform 
distribution on the orthogonal group (Haar measure). Mczzadri (2007); Diaconis and Shahshahani (1986) 
develop this. There are also elegant algorithms for sampling from the boundary of compact, convex sets 
in M. n (Bclisle, Romcijn and Smith, 1993; Bocnder et ai, 1991). A different procedure, the Lalley and 
Robbins (1987) "princess-and monster" algorithm has been studied for sampling from the boundaries 
of more general sets (Comets et ai, 2009; Narayanan and Niyogi, 2008). These algorithms are based on 
moving within the interior of the bounded set reflecting off the boundary. They are different from the 
present procedures and may be very effective when applicable. We do not know previous literature on 
sampling from more general manifolds. 

Of course, conditional probability densities are standard fare, even with very general conditioning. 
However, explicit description of area measure and the use of the co-area formula is not so common. We 
only know of the foundational monograph by Tjur (1974). This contains a good history. The development 
is both more and less general. Tjur works with Riemannian manifolds and smooth functions. We work 
with embedded manifolds but allow Lipschitz functions such as max/min. Tjur gives a self-contained 
development based on Radon measures. We are able to use more classical foundations from standard 
sources. Tjur's valuable monograph was written before the computer revolution. We emphasize techniques 
useful for sampling. 

This paper studies the following problem of sampling from M 1 an m-dimensional submanifold in 
R". Consider f(x) > such that J M f(x)ii m (dx) < oo with / H m (dx) the m-dimensional area measure 
on M.. Samples are to be drawn from the normalized version of /. Section 2 gives basic definitions for 
submanifolds, area measure, Jacobians and the co-area formula. These notions are illustrated on examples 



Section 3 develops the theory for exponential families, Section 4 that of Neyman's smooth test. 

The algorithms presented are reasonably standard Markov chain Monte Carlo methods supplemented 
by some geometrical tricks and the tools of geometric measure theory. We hope they will be useful to 
researchers who face similar problems. 

The subject developed here may be considered as a continuous analog of algebraic statistics as ini- 
tiated in Diaconis and Sturmfcls (1998) and reviewed in Drton, Sturmfels and Sullivant (2009). That 
theory began by developing algorithms for sampling from the conditional distribution of discrete expo- 
nential families given their sufficient statistics. There, finding ways of moving around on the space of data 
sets with given sufficient statistics leaned on tools from computational algebra (Grobner bases). Here, 
the same task is studied using direct geometric analysis and tools such as the curve selection lemma. 

2. Definitions and Tools 

The classical subject of calculus on manifolds has an enormous expository literature. We have found the 
elementary treatment of Hubbard and Hubbard (2007) readable and useful. In our applications, pieces of 
manifolds with corners occur naturally. For example, testing the three-parameter Gamma density gives 
rise to 



1A,1B. 



n n 



M = Uxi, ...,x n ); Xj > 0, Xj = S. Y[ %i 



= P, min x 




i=l i=l 
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Geometric measure theory provides convenient tools. We use Federer (1996), denoted [F], as a principle 
reference. The introductory account by Morgan (2009) gives a useful taste of the subject matter. Recent 
references are Mattila (1999), Krantz and Parks (2008). 

2.1. First Definitions 

A function / : IR m — > R" is Lipschitz if \f(x) — f(y)\ < c\x — y\ for some finite, positive c. Euclidean 
distance is used for | • | on both sides. A set in M. n is m-rectifiable [F, p. 251] if it is the Lipschitz image 
of a bounded subset in IR m . This is a very rich class of sets, discussed at length in the references above. 
All of the sets that arise in our applications are rectifiable. 

Use X n (dx) for Lebesgue measure on the Lebesgue measurable sets of M. n . Given any subset A C M. n , 
define the m-dimensional Hausdorff measure T-L m {A) by 

H™(A) = Um inf £«4^^r 

V ' 5^0 ACUS,, ^ \ 2 / 
diam(Si)<(5 

The infimum is taken over all countable coverings Si of A with diam(S'j) = sup{|x — y\ : x,y € Si} and 
a m = r(i) m /r[(^) + 1], the volume of the unit ball in K m . Hausdorff measure is an outer measure which 
is countably additive on the Borel sets of R". It serves as area measure for subsets. If the set A above is 
m-rectifiable, the coverings above can be restricted to balls or cubes [F, Sect. 3.2.26]. For a closed set A, 
[F, Sect. 3.2.39] shows U m (A) = lim^o X n {x : dist(x ! A) < e} / 'a {n „ m) e n - m , thus justifying the heuristic 
definition of area measure in Example A of Section 1. 

To actually compute area measure, the Jacobian is an essential tool. Call / : M." 1 — > R™ differcntiablc 
at x £ M. m if there exists a linear map L : W n — > R™ with 

\im\f(x + h)-.f(x)-L(h)\/\h\=0. 

h— 5-0 

The linear map L is denoted Df(x) when it exists. A celebrated theorem of Rademacher [F, Sect. 3.1.6] 
says that a Lipschitz function is differentiable at A m a.e. x £ M. m . For a differentiable function, Df can 
be computed using partial derivatives Di{x) = lim^_ J .o(/(a ; ij ■ ■ • , x i + h, . . . , x m ) — f{x))/h. As usual, the 
derivative matrix is 

(Df(x))ij = Difiix) 1 < i < m, 1 < j < n 

If / : W n — > K n is differentiable at x, the fc-dimensional Jacobian Jkf(x) may be defined as the norm of the 
derivative matrix [F, page 241]. Geometrically Jkf(x) is defined as the maximum fc-dimensional volume of 
the image under Df(x) of a unit /c-dimensional cube in IR m (the maximum over all possible rotations of the 
cube under orthogonal rotations in O m (Morgan, 2009, p. 25)). As usual, if rank Df(x) < k, Jkf(x) = 0. 
If rank Df(x) = fc, then (Jkf(x)) 2 equals the sum of squares of the determinants of the kxk submatrices 
of Df(x). Usually, k = m or n. Then (Jkf(x)) 2 equals the determinant of the kxk product of Df(x) 
and its transpose. If k = m = n, Jkf(x) is the absolute value of the determinant of Df(x). 

2.2. The Area Formula 

The basic area formula [F, Sect. 3.2.5] is a useful extension of the classical change of variables formula of 
calculus. 

Theorem: Area Formula // / : M m — > K n is Lipschitz and m < n, then 
(2.1) [ g(f(x))J m f(x)X m (dx)= [ g( y )N(f\A, y )H m (dy) 
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whenever A is X m measurable, g : R n — > K is Borel, and N(f\A, y) = H={x G A : f(x) = y}. 



Remarks 

1. In this paper, / is usually a parameterization of a sub-manifold M., so / is 1 — 1 and the right-hand 
integral is the surface area integral of g over f(A). The left side shows how to carry out this integral using 
Lebesgue measure on M. m and the Jacobian. It shows that sampling from the density J m f(x) (normalized) 
on R m and then mapping onto Ai via / gives a sample from the area measure. 

2. There are many extensions and refinements of the area formula [F, Sect. 3.2]. In particular [F, 
Sect. 3.2.20] extends things to approximately differentiable functions and [F, Sect. 3.2.46] extends from 
Euclidean space to Riemannian manifolds. 



Example 1A continued: The Curved Torus For the parameterization given in Example 1A, the 
curved torus is the Lipschitz image of {0 < 9,ip < 2tt}, with 



(2.2) 



/(M) = (i? + rcos(60)cosWO,(i? + rcos(6)))sin(VO, rsin(0)) 



(2.3) 



Df(e,i>) = 



-r sin(0) cos(V') — (i? + r cos(0)) sm(ip) 
-r sin(0) sin(^) (R + r cos(0)) cos(tp) 
rcos(9) 



(2.4) 



J 2 2 /(0» =det 



(i? + rcos(6»)) 5 



r 2 (R + rcos(9)) 2 



As explained in Section 2, A4 is parametrized by U = {8, ip, < 6, ip < 2ir} and the task reduces to 
sampling (9,tp) from the density g(6,ip) = + (r/R)cos9). A random number generator outputs 

points that we assume are uniformly distributed on [0, 1] and the task reduces to converting these into a 
sample from g. From the form of g, the measure factors into the uniform density for tp on [0, 2tt) and the 
density 

g x {9) = + ^ cos ^) On0 < 6 ' <27r - 

We may sample points from g\ by rejection sampling (Hammersley and Handscomb, 1964). The function 
(1 + (r/R)cosO) is enclosed in the box < 6 < 2tt, [1 - (r/R) < rj < 1 + (r/R)]. Choose points (9,n) 
uniformly in this box from two-dimensional Lebesgue measure. This uses two calls to the underlying 
uniform random number generator. If 77 < 1 + (r/R) cos#, output 9. If not, choose again, continuing until 
the condition holds. The resulting 9 is distributed as g\. Sample code for this is in algorithm 1. 



Algorithm 1 Rejection Sampling yielding g\. 

rej ect=f unction (n=100 , r=0 . 5 ,R=1) { 
#Rejection sampler 

xvec=runif (n, , 2*pi) 
yvec=runif (n, , 1/pi) 
fx=(l+(r/R)*cos(xvec))/(2*pi) 
return(xvec [yvec<f x] ) } 



What we get is a density with support [0, 27r]. See Figures 1 and 2 below. 
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Fig 1. Rejection sampling density proportional to 1 + ^cos(8) 



Example IB continued: Sum and Product Fixed Here 

n n 

M = j(ari, . . .,x n ); x t > 0, J^x, = 5, JJa;* = p}. 



i=l 



The constraints S, P satisfy < P 1 /™ < S/n because of the arithmetic-geometric mean inequality. Any 
such S, P can occur. To find a parameterization of M. consider the projection 

II : M -> R n ~ 2 

(xi , . . . , X n ) y (X3 , . . . , X n ) 

Let s = x 3 + -- -+x n = S — t with t > and X3X4 ■ ■ ■ x n = p. The equations X\ + x 2 = t, X\X 2 = P/p 
have a positive real solution if and only if t 2 > AP/p. In this case the solution is the pair 



{xi,x 2 } = [t±^-^j/2. 
One way to parametrize M is to define 

(2.5) M + = {x e M : x x > x 2 }, Mr = {x E M : x x < x 2 } 

Define U = {{x 3 , . . . , x n ) : Xi > 0, s < S,p < 4P/(S* - 4) 2 } a = £™ =3 &i> P = U7=a ^ 
f :li -¥ M + is defined by 

(2.6) f(x 3 ,...,X n ) = (fl(x 3 , . . . ,X n ), f 2 {X3, ■ ■ ■ ,x n ),x 3 , . . . ,x n ) 

(s - EL 3 xi) + J{s-Y,U*i) 2 -TFiz 

with /i(a;3,...,a; n ) = ? 
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var 1 
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**•=* tVis**- 






ft .;.-.* - ••.■>*->-^.-\J 




var 2 












> 'ut;«.wiu> > 




f.-- ;• * ••&:•-•*./•*.. *..\ 




var 3 



-2 -1 1 2 -0.5 0.0 0.5 



Correctly generated points uniformly on the torus 



-2-10 1 2 



var 1 




' ^jj^" **"* 




Sg. • • • t. ■ ■■' , -ifi 








, aS . ; ;;,,**:.. 




var 2 












V' '*;*// 








var 3 



-2 -1 1 2 -0.5 0.0 0.5 



Basic Uniform on Parameters. 



Fig 2. Top figure shows a 3D representation of a sample of size 1000 with parameters R = 1, r = 0.9, the lower figure 
shows the incorrectly sampled points, although the difference is not obvious visually, standard tests pick up the difference 
between the two distributions. imsart-coll ver. 2011/11/15 file: DiaconisHolmesShah.tex date: July 6, 2012 
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and f 2 (x 3 ,...,x n ) 



The derivative is the n x (n — 2) matrix 



(s-Er= 3 *iWOs-E?= 



(2.7) 



£>/ = 



#3/1 Difi 

D 3 f 2 Difc 
1 
1 



\2 _ iP 

,i=3 ■*-!■) nr= 3 »i 



AJl 
I>n/2 






( Jn~2f{x)) 2 = dct((Df) T Df) is the determinant of a matrix of form J„_ 2 + VV T + WW T with F T , W T 
the first and second rows of Df. A well-known determinant identity reduces this to a 2 x 2 determinant; 
if B is p x m and C is m x p then det(/ p + BC) = det(I m + CB). It follows that 



(2.8) 



((J„_ 2 /(a;)r = det U 



To summarize: 

Proposition 1. The density of the (n — 2) -dimensional area measure T-L n ~ 2 on the submanifold M + 
in (2.5) parametrized by f : hi — > A4 + is J n -2f(x) of (2.8), above with V, W the first two rows of matrix 
(2.7). 



Remarks 

1. Up to sets of H n ~ 2 measure 0, a similar result holds for Mr . Since Mr and M + patently have 
the same area measure, it is easy to sample from M using an additional coin flip to randomize the first 
two coordinates. 

2. Of course, any (n — 2)-tuple of coordinates can be used for the parameterization. In practical 
simulation, it might be wise to sample from M as above and follow this by a random permutation of the 
coordinates. 

3. The function / defined in (2.6) , is only locally Lipschitz because of p in the denominator. However, 
U may be decomposed into a countable union of pieces with / Lipschitz on each piece. Because the formula 
for J n f is local, the proposition is true as stated. 

Given S, P, the manifold M + is parametrized by hi of Example IB. The task of sampling from 
area measure on M is reduced to sampling from J n _2/(x) in hi. One problem here is that although 
z = j u J n ^2f( x )^ ri ^ 2 (dx) < oo and J n -2f I ' z is a probability density on hi, the value of z is unknown. 
This standard problem may be approached by the Metropolis algorithm, the Gibbs sampler, importance 
sampling, or by the hit-and-run algorithm in many variations (see Liu (2001), Andersen and Diaconis 
(2008) for background). Here, we briefly explain the Metropolis algorithm for sampling from J n -2f ■ This 
generates a Markov chain Xq, Xi, X2, ■ ■ ■ starting from Xq — xq, a fixed point in hi. From X n = x G hi, 
we propose y £ M. M , choosing y = x + e with e chosen (say) uniformly from a unit cube centered at x. 
Then, 



X n +i — 



y with probability min ( j" » 1 



otherwise. 



Since J n -2f{y) is taken as outside hi, note that X n+ x £ hi. Standard theory shows that for n large, 
P(X n j A Jn - 2 J( X 1 A M (dx). Careful evaluation of how large n must be to make this approximation 
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valid is an open research problem both here and in most real applications of the Metropolis algorithm 
(see Diaconis and Saloff-Coste (1998) and Diaconis, Lebeau and Michel (2010)). A host of heuristics are 
available for monitoring convergence; for adapting the choice of the proposal for e and for efficient use of 
the output. We will not discuss these further here. 

Several further examples admitting an explicit parameterization, with computations of J/, are in 
Hubbard and Hubbard (2007, Chap. 5) which is enthusiastically recommended to newcomers. 

2.3. Conditional Densities and the Co- Area Formula 

Federer's co-area formula gives an explicit density for the conditional distribution. The main tool is: 

Theorem: Co-Area Formula [F, Sect. 3.2.12] Suppose that $ : R AI -> is Lipschitz with 
M > N. Then 

(2.9) / g(x)J N <S>{x)X M {dx) = f [ g(x)H M - N (dx)\ N (dy). 

JR M JR n J'!>- 1 (y) 

In (2.9), g is Lebesgue measurable from R M — > K and Jn& is defined in Section 2.1. 

Recall next the definition of a regular conditional probability. Let (fi, J 7 , P) be a probability space 
and C C T a sub-sigma algebra. A function P(w,dw) from (fi x F) into [0, 1] is a regular conditional 
probability for P given C if 

(2.10a) For each w £ Q,P(w, •) is a probability measure on T . 

(2.10b) For each F £ J 7 , the function w i-> P(w, F) is C measurable. 

(2.10c) For C eC,F £ F,P{Cr\F) = [ P(w,F)P(dw). 

Jc 

Let p(x) be a probability density on R M with respect to X M (dx). Let $ : R M — > R N be Lipschitz 
with M > N. From Rademacher's Theorem, $ is differentiable at almost every x, and Jn&{x) can be 
computed by the usual rules. 

Proposition 2. Suppose that Jn$(x) exists and is strictly positive for all x where p(x) > 0. Then 

(a) The marginal density o/$ is absolutely continuous with density 

m(y) = I P<yX } M M ~ N idx) with respect to X M (dy). 

(b) If m(y) £ {0,oo} ; set Q(y,F) = 5 X *(F) for some fixed x* £ R AI . Else set 

Q(y,F) = n ^ r [ ^-M M - N {dx). 

m (y) J*~i(y)nF Jn9{x) 

Set P(x,F) = Q(Q(x), F). Then P is a regular conditional probability for P(dx) — p(x)X M (dx) 
given C = $ _1 (S) with B the Lebesgue measurable sets in K. . 

Proof. Clearly (2.10a) and (2.10b) are satisfied. To show (2.10c), fix C £ C and F a Lebesgue measurable 
set in R M . Take g in (2.1) to be 

fcnf(^) with g(x) defined as if p(x) = 0. 

Where ScnF denotes the indicator function of the intersection CflF. 

imsart-coll ver. 2011/11/15 file: DiaconisHolmesShah.tex date: July 6, 2012 



10 



Diaconis, Holmes, Shahshahani 



The co-area formula shows 



P(CnF) = I p(x)X M (dx) = I I 5c(x)S F (x)-?^H M - N (dx)\ N (dy) 

JCnF JR« J*-l(ly) Jn^(X) 



U M ~ N {dx)\ N {dy). 



Let C = {y : m{y) = ()},£«, = {y : m(y) = oo},C+ = (C U C^) . Taking C = F = R M , we see 
\ N (Coo) = 0. For y € C , U- Hy)nF j^U M - N (dx) = 0. Hence, the integrals equal 

P( x ) ilM-N fj„\\N i 



U M -»{dx)\»{dy) 

cnc+ J^- 1 (y)nF Jn^(x) 
<■-,('+ m{y) i$-i( s )nF Jjv$(z) 
c 



/ P{x,F)P{dx) 
Jc 



□ 



Remark Of course, m(y) can be 0, if "3> -1 (y) is empty or p vanishes there. Similarly, m(y) can be 
infinite: consider (following Tjur [1972, Sect. 30]) a set of finite area in E 2 of the shape shown in Figure 
3. Let p(x) be the normalized indicator of this set. Let §(x,y) = x, so Jn^(x) = 1. Then m(0) = oo. 

Example 1A (continued): From (1.1) the torus is {(x,y,z) € R 3 } 

x = (R + r cos(0)) cos(-0), y = (R + r cos(0)) sin(-0), z — rsm(9) 

< 6, tp < 2ir for R > r > 0. What is the conditional distribution in (8, ip) space given that x = 0? In 
the notation of Proposition 2, 



i(l+£cos(0)) 0<6>,^<2tt 
elsewhere 



The function $ : R 2 -> i? is 

$(6», V) = (i? + r cos(6>)) cos(^) 

Thus 

( J$) 2 = (r sin(0) cos(V>)) 2 + ((R + r cos(6>)) sin(V>)) 2 

<f- 1 (0) = {(6^), 0<9<2n, 

It follows that J&(0, |) = J$(0, = i? + rcos(#). This is proportional to p(9,ip) and Proposition 
2b says that the conditional distribution is uniform on the two line segments that make up $ _1 (0) and 
assigns equal mass to each segment. 



Example IB (continued) Consider the area measure on A4 + of (2.5). Proposition 1 above shows 
that A4 + is parametrized by a map / from the set U C R n ~ 2 and gives an explicit expression for 
the corresponding probability density. One standard method for sampling from this density is to use 
the Gibbs sampler. This entails sampling from the conditional distribution given the values at some 
of the coordinates. One simple implementation which uses Proposition 2 is this: A4 + is given as an 
embedded manifold in R™. From (xx,X2, ■ ■ ■ ,x n ) G A4 + , choose three coordinates uniformly at random, 
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Fig 3. Instance of Infinite region. 



fix the remaining (n — 3) coordinates. The map / of Proposition 1 composed with the projection onto the 
corresponding (n — 3) space gives a map $ : U — > K ra_3 . The conditional density given <!> = y is explicitly 
given by Proposition 2. Here $ _1 (0) is a one dimensional curve and the sampling problem reduces to a 
standard task. We omit further details. 



Example 3C: How Not To Sample Here is a mistake to avoid. Let M. be a compact embedded 
manifold. To sample from the area measure, the following scheme presents itself. Suppose that for each 
point i€A<a neighborhood M x Q M. is specified (e.g., a ball of specified radius on M). Suppose it is 
possible to sample from the area measure restricted to N x ■ It seems plausible that this drives a Markov 
chain with area measure globally. This is an error. Perhaps the easiest way to see through the problem is 
to consider the discrete case: 

Consider a finite connected undirected graph with vertex set X and edge set £ . Let 7t(ie) > 
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0, J2xsx n ( x ) = 1 be a probability distribution X. Suppose for each point x £ X , a neighborhood Af x is 
defined. These may be arbitrary finite sets; we do not need x £ J\f x , but will assume y £ Af x X € My. 
For example, we may take AT X — B r (x), the r-ball using graph distance. A Markov chain on X is defined 
as follows: 

From x, choose y £ M x with probability w restricted to M x . Thus 

(2.11) K ^ v)= Un,) dVEJV. 

10 otherwise 

Lemma 1. The chain (2.11) is reversible with reversing measure 
. . . . k(M x ) , k{x) 

(2.12) a \ x ) — 7 w*" 1 z a normalizing constant. 



Proof. If K(x,y) = 0, then K(y,x) — 0, so reversibility holds. Otherwise 

a{x)K{x,y) = — -r- = = o{y)K{y,x). 

z n(M x ) z 



□ 



Remarks 

1. Thus, unless Tt(J\f x ) = constant, o~(x) ^ ir(x). 

2. In the continuous setting, sampling locally from area measure H, this chain has stationary density 
proportional to %{M X ). An analysis of rates of convergence for this walk on compact Riemannian manifolds 
in Lcbeau and Michel (2010). 

3. On a curve, with distance measured by arc length, T-l(B r {x)) is constant for r suitably small 
because of the volume of tubes theorem. However, this is no longer true for higher-dimensional manifolds 
with non-constant Gaussian curvature. 

4. We may use the Metropolis algorithm to change the stationary distribution from a in (2.12) to 
7r. The chain is A4(x,y) = ir(y) min( ^ , ^ ) for x ^ y £ J\f x . Note that this requires knowledge of 
n{M x ), Tt{N y ). 



3. Exponential Families, Conditional Densities and the Co-Area Formula 



One motivation for the current work is conditional testing in statistical problems. This is a central topic 
of classical statistics beginning with R. A. Fisher's exact test for independence in contingency tables and 
the Neyman-Pearson theory of uniformly most powerful unbiased tests for exponential families. The best 
general reference for these topics is (Lehmann and Romano, 2005, Chap. 4, 5, 10) See also the survey 
in Diaconis and Sturmfels (1998) and the techniques and references in Lindqvist and Taraldsen (2005, 
2006). 

The problems addressed in the present paper are a continuous analog. Section 3.1 below presents 
exponential families in a version convenient for applications. Section 3.2 briefly discusses conditional 
densities and sufficiency. Section 3.3 uses the co-area formula to give a useful expression for the conditional 
density, given a sufficient statistic, with respect to the area measure. These formulae are applied in Section 
4. 
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3.1. Exponential Families 

Many widely-used families of probability measures, such as the Gamma family of Example IB, have a 
common exponential form. Theorems and properties can be derived generally and then applied in specific 
cases. A good first reference for this material is (Lehmann and Romano, 2005, Sect. 2.7). The specialist 
monographs of Barndorff-Nielsen (1978), Brown (1986) and Letac (1992) may be supplemented by the 
references in Diaconis, Khare and Saloff-Coste (2010) to give an overview of this basic subject. 

Let T : M. a — s- K° be a measurable function. Let O C M. b be a non-empty open set and ip '■ 6 — >• K fc a 
measurable function. Let fix) : W 1 — > R + be measurable and suppose 

< z{9) = { f(x)e^ w * T{x) \ a (dx) < oo for each 9 G 6. 



Definition The family of probability densities 

(3.1) P e {x) = z~ 1 {9)f{x)e^ e)mT ^ deO 

is called the exponential family generated by (f,Q,ip,T). 

For the Gamma family in Example IB, a = 1, b = 2, T(x) = 



(x, logx) x > 
otherwise 



e = R 2 + = {(^a):cr,a>0}, #r, a) = (- 1 , a - 1), z{8) = a a T(a) f(x) = i J ^ > ° . 

(7 10 otherwise 

The exponential families here are a subclass, in having absolutely continuous densities whose support 
does not depend on 9. 



3.2. Sufficiency 

The product measure on (M a ) n generated by Pg of (3.1) has density 

n 

z(9)- n Y[f(x l )e^ e >^ T ^\ 
j=i 

The function T = 22=i ^(^i) i s called a sufficient statistic for the family. The references above show 
that the distribution of the product measure conditional on T does not depend on 9. Conversely, the 
Koopman-Pitman-Darmois theorem says if Pg is a family of measures on R a with T locally Lipschitz 
and for some n > 2, the distribution of Pg, conditional on T, does not depend on 9, then Pg is an 
exponential family. See Hipp (1974) for a careful statement; see Diaconis (1988) for background and 
further references on sufficiency 

For the Gamma family, T is equivalent to S — x hP = 11"= l Xi as use d thoughout. 



3.3. Conditional Densities and the Co- Area Formula 



This dual to the area formula is explained in Section 2.3 above. We may use it directly to compute an 
expression for the conditional density of an exponential family given a sufficient statistic. 
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Theorem 1. With notation as above, forna > b consider an exponential family (3.1) based on a Lipschitz 



'//(:' 



T : R a -)• R b . Let T : R na -> R b (= Y,i=i T ( x i)) and suppose J b T{x) ^ for l\f( x i) ^ °- ^ 
A^t = {x € (M a ) n : r(a3) = £} Then, the conditional density on Mt with respect to the area measure is 

n 

(3.2) W^HfixJ/Jbfix). 

i=l 

with the normalizing constant W = W t taken to be J H"=i f( x i)/JbT(x)'H M ~ N (dx) if this integral is in 
(0,oo). 

Proof. In the co-area formula take W = T : (M a )" M b . Thus M = na, N = 6. For /i : M M ->• l w 
bounded continuous, set 



g{x) = l Jn^(x) 

[ otherwise 



Then 1 (t) = .M t and the co-area formula shows that M. t has positive, finite total area measure for 
a.e.t. Further 



M-N f . \ \N 



{dx)\"(dt). 



This formula says that (3.2) is a regular conditional probability for the product measure n"=i^ > f( a; ) 
given T — t. □ 

i?e?7iarfcs 

1. Since the conditional density (3.2) does not depend on 6, f is a sufficient statistic. 

2. The calculation shows that the marg inal density of T is e*^ -t / z(6») n VF with respect to A h (di). 
Thus the induced measures of T form an exponential family. 

Example: Gamma Family With T : E" — > M 2 given by f (x) = (J^" =1 £j, SlLi logXj), f° r ™ > 2, 

1 1 \2 



DT(x) 



1 1 ... 1 



W = £ ( 



From Theorem 1, we may sample from the conditional distribution of the Gamma family given T = t on 
A4t by sampling from the probability density (w.r.t. A" -2 ) proportional to 

J n -2f(x3, ■ ■ ■ , x n ) 
hT{J{xz, ■ ■ .,i„)) 

on U and / defined in (2.6), followed by randomizing the first two coordinates by a fair coin toss. 



4. Neyman's Smooth test and the Gibb's sampler 

This section illustrates a useful general procedure (the Gibbs sampler) in a natural example: Neyman's 
smooth test for goodness of fit. The problem reduces to sampling from an explicit density f(x±, X2, ■ ■ ■ , x n ) 
on the following submanifold: fix to and p\>pi>-- -p m - Let 

a 

M p = {xi,x 2 , ...x n ,0 <Xi< 1, x) =pi, 1 < i < to}. 

imsart-coll ver. 2011/11/15 file: DiaconisHolmesShah.tex date: July 6, 2012 



Sampling From A Manifold 



15 



In Neyman's case, m — 4, assume this for now. The idea underlying our algorithm, developed below, is 
to pick a uniformly chosen subset of m + 1 = 5 coordinates with probability 1/(5), say the first five. Set 
pi = x l y The submanifold 

5 

(4.1) M p = {xi,x 2 , ■ ■ ■ x 5 , < Xi <l,^x)=pi,l<i< 4} 

i=i 

is a curve which lies both on the submanifold A4 p and in M 5 . We may sample from the conditional density 
on the curve and replacing (x\, X2, ■ • ■ , £5) by the sampled values gives a new point on M. v . 

Repeatedly choosing fresh five-tuples gives a connected reversible Markov chain on M p with / as 
its stationary density . In the present section we find it convenient to work directly with the density of 
/ with respect to the area measure, avoiding the extra step of local coordinates. Neyman's smooth test 
is developed in 4.1, the relevant conditional densities are derived in 4.2 and 4.3 contains a study of the 
ergodicity of this chain. Section 4.4 develops an idea of Besag and Clifford (1989) for valid testing with 
non ergodic chains. 

4- 1. Neyman's Smooth test 

Consider investigating the following null hypothesis; fix F a distribution function of a continuous random 
variable. Let 

(4.2) H :X 1 ,X 2 ,X 3) ...X n ~iid F 
Then 

Yi = FiX,) 

are iid uniform on [0, 1]. Neyman (1937) developed a test of H based on testing 9 = in the model 

(4.3) f e (y) = z - 1 e ? 1 v +93V ' +0aV * +o *v\ 0<y<l 

This test (and its modifications by David (1939); Barton (1953, 1956)) has been shown to have a good 
empirical track record and comes in for repeated favorable mention in Lehman and Romano's survey of 
testing goodness of fit (Lehmann and Romano, 2005, chapter 9). That chapter also explains the difficulty 
of such omnibus testing problems. One justication for this test is that if the data are from a smooth 
distribution F, using a simple \ 2 test loses information because it breaks the data into categorical bins, 
losing the actual ordering of the bins. 

Any smooth positive probability density h(y) on [0, 1] can be expanded as 

h(y) = e lo9h{v) = e^=o 9 ^ 

The four parameter exponential family is a commonsense truncation of this non-parametric model. 
Fan (1996) has developed tests based on m term approximations with m chosen adaptively from the data. 

In the rest of this section we investigate the adequacy of the truncation (4.3) (with m = 4) by testing 
if the model (4.3) fits the data. Thus given data Yi, F2, ■ • • Y n in [0, 1], we develop conditional tests of the 
model (4.3). These ideas work for every m and could be used as input to Fan's adaptive procedure. The 
four dimensional sufficient statistics for the family (4.3) is 

n 

P = (Pl,P2,P3,Pi), Pi = ^2 Y J 
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The conditional procedures explained in section 4.2 are based on the conditional distribution of the model 
fg given p. This is supported on 

n 

(4.4) M p = {(xi,x 2 ,...,x n ),0< Xi <l,^x)=p h l<i< 4} 

This is a compact n — 4 dimensional submanifold of [0, 1]™. To actually construct a test, a test statistic 
must be chosen. Neyman's test of section 4.1 was based on the L 2 norm of the averages of the first four 
orthogonal polynomials for the uniform distribution on [0, 1]. Under (4.3) the sum of these norms should 
have been an approximate chi-square (4) distribution. We may follow Neyman, using a further orthogonal 
polynomial as the test statistic but calibrating it with the exact conditional distribution. 

4.2. The Gibbs Sampler 

The Gibbs sampler is well developed in Liu (2001). As usually explained, to sample from a probability 
density g(z\, z%, . . . , z n ) on K n one begins at a sample point zq = (zj, z®, • • • , z„) and changes coordinates 
sequentially: first to {z\, z%, ■ ■ ■ , z°) then to (z{, z\, . . . , z„) ...then Z\ — (z{, 25, ... , The ith change 
is made by sampling from the conditional distribution of the ith coordinate given all the rest. The one 
dimensional problem is supposed to be easy to do. The transition from z° to z 1 is one step of the Gibbs 
sampler. Proceeding as above to z 2 , z 3 ,. . . gives a Markov chain with g as stationary distribution. In the 
present problem 

(a) It is not possible to change just one coordinate and stay on the surface (4.1). The minimal change 
is in five coordinates resulting in the curve 

5 

(4.5) {(x 1 ,x 2 ,...,x 5 ) : < n < 1 ^= Pi }. 

i 

(b) Instead of random sampling, one can systematically run through all sets of five coordinates using 
for instance a Gray code approach as in Diaconis and Holmes (1994). 

(c) Sampling from a conditional distribution on the curve in (a) is not so simple and instead a single 
Metropolis step is proposed. This is sometimes called 'Metropolis on Gibbs' in the literature, for 
notational clarity we suppose that the five chosen coordinates are the first five. Let P be the 
conditional distribution for the model (4.3) on the submanifold (4.1). Let Q be the conditional 
measure on the curve (4.5). The following proposition determines the density of Q with respect to 
arc-length. 

Proposition 3. The measure Q on the curve (4-5) has density with respect to arc-length 
q(x\, X2, X3, Xi, X5) = z^ 1 \l J4T 1 z _1 a normalizing constant 



5 

^x),l<i<h 
i=i 

Proof. By the usual calculus of double conditioning, Q is the conditional distribution of the product 
measure fg on [0, l] 5 given pi,p2,P3,p4- Now use Theorem 1 of section 3.3. The mapping T : [0, l] 5 — » R 4 
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takes T(yx,y 2 ,y 3 ,y 4 ,y 5 ) = (pi,P2,P3,P4)- Clearly the 5 x 4 derivative DT is 



A 
i 

DT = 1 
1 

V 

so that J4 is given by (4.6) as claimed. 
Remark: 

For general m, the density is proportional to J m 



2tfi 

2j/4 

2y 5 



3yl %3\ 



3?/ 2 2 

2 



32/1 
32/1 



42/1 
42/sV 



with J m having i, j entry i ■ jp 



'i+j-2 



1 < i, j < m The 



following algorithm combines the ideas above to give a reversible Markov chain for sampling from the 
conditional distribution of the model 4.3 on the manifold From x g Air, 



(a) Choose five coordinates uniformly at random. Without loss, suppose these are the first five, calculate 

Pi = E*=i x i> 1 <*< / l- 

(b) Pick a small parameter e, then choose one of the five coordinates uniformly at random without loss, 
suppose the first coordinate has been chosen. Change x\ to y\ = x\ + £i with ei chosen uniformly 
in [xi -e,xt + e]. Solve for 2/2,2/3,2/4,2/5 so that y = (2/1,2/2,2/3,2/4,2/5) € TWp as in (4.1). 

(c) Calculate J4(x), J/t(y) from Proposition 3 above. If J^x) > J^y) the algorithm moves to y. If 
Ji(~x) < J4(y) flip a coin with success probability 



f J 4 (x) 

^ 4 (y) 



If success move to y, otherwise stay at x 
Remarks: 



For m < 4, calculations for solving the y can be done in closed form as they involve at most quartic 
equations. For higher m a variety of numerical procedures are available. 
Of course, if y in step (b) is outside [0, l] 5 , the algorithm stays at x . 

We began studying the problem hoping to parametrize the curve (4.1) and sample directly from the 
arc length measure. This proved impractical. The technique we have developed seems easier and is 
applicable to general continuous exponential families. 



4-3. Ergodicity 

Let Pj(x) = x\ + . . . + x 3 n and S be the set defined by 

(4.7) < xi < ... < x n < 1, Pi (a) = ci, . . . ,Pa{x) = c 4 . 

The closure of S will be denoted by S. We also assume that 1 > c\ > c 2 > C3 > C4 > which is a 
necessary condition for the existence of a solution to (4.7). Assume that the system (4.7) has a solution. 

Lemma 2. Let y G S be a solution to (4-7). Then there is a submanifold of dimension n — 4 passing 
through y € S . Furthermore the orthogonal projection of S near y on any coordinate line Xj contains a 
neighborhood of yj . 

Proof. We have dPj(x) = jx{~ 1 dx\ + . . . + jx 1 rl ^ 1 dx n . Therefore to show the first assertion it suffices to 
show that the matrix 



fl 


1 . 






x 2 ■ 


x n 


x\ 




x 2 


\x\ 


x 2 


• <) 
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has rank 4 which is immediate. The second assertion follows from the fact that the locally the system 
can be solved near y as function of any n — 4 coordinates. □ 

Lemma 3. Let k > 6 and y <= S be a solution to (4-7) ■ Consider the solution of the system (4-7) subject 
to the additional requirements 

x 3 = Vh for 3 ^ k - 

Then there is a submanifold of dimension k — 5 of solutions passing through y. For k — 6 the solution is 
a curve and its projection on the coordinate line x i7 1 < i < 5 contains a neighborhood of yi. 



Proof. We look at the differentials dPj , j 
to show that the (n — k + 5) x n matrix 



1, 2, 3, 4 and dxj, j > k. To prove the first assertion it suffices 



/l 


1 . 




1 \ 


Xi 


x 2 . 




• • • %n 










x\ 


x 2 




x 2 


x\ 


x 2 




X 3 





. 


. f 


... 





. 


. 1 


... 




. 


. 


... l) 



has rank n — k + 5 which is obvious. The second assertion follows from the fact we can solve for n — 1 
coordinates in terms of any one of x^s for i = 1, 2, 3, 4. □ 

Let M C S be a connected component of S. We consider the following process in M. Given that the 
process is at y — (yi, . . . , y n ) € M, one chooses five coordinates i\, . . . , 15 and the process can move to 
any point along the curve defined by 

x i = Vh for 3 J^k,-- -,k- 

The question we want to answer is whether any two points in M communicate in the sense that one can 
move from one to the other in a finite number of iterations. More technically, we say two points, y, z are 
sufficiently close if, given y G M there is S > such that if z is within S of y, then one can move from y 
to z in finite number of iterations. The positive number 5 may depend on y. 

Lemma 4. // two points y,z € M are sufficiently close then they communicate. 

Proof. We do induction on n. The case n = 5 is clear. Let i±, . . . , i§ = 1, . . . , 5 and k 
of Lemma 3. Then the determinant of the matrix in the proof of Lemma 3 is 



6 in the notation 



± n & 

i<j<6 



Xj). 



Therefore if y and z sufficiently close then one can move from (y±, . . . , y n ) to a point 

(zi, y' 2 , ■ ■ • , 2/5, ?/6j ■ ■ ■ j Vn) by the second assertion of Lemma 3. Now the induction hypothesis applies to 

complete the proof. □ 

Proposition 4. Any pair of points in M communicate. 

Proof. Starting at y G M we show that the set of points in M that can be reached in a finite number of 
steps from y is both open and closed in M . The required result then follows from connectedness of M . 
From Lemma 4 it follows that the set of points that can be reached from y in finitely many iterations is 
open. To show closed-ness let y — y^jy^ 2 ', ... be a sequence of points each of which can be reached in 
finitely many steps from y and assume y( m > — > z 6 M. Then for all m sufficiently large the point t/ m ) 
lies is a sufficiently small neighborhood of z and Lemma 4 is applicable to show that z can be reached in 
finitely many steps from such j/ m ) proving the required 'closed-ness'. □ 
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Let S' be the set defined by 



(4.8) <&!,..., x n < 1, Pi(x) = ci,...,P 4 (a;) = C4, 

and M' be a connected component of S' . We consider the process in M' where in addition we allow any 
permutation of the coordinates as well as evolution described in M. 

Proposition 5. Any pair of points in M' communicate. 

Proof. For points away from the set V consisting of the boundary hyperplanes of the unit cube in R™ 
and the generalized diagonal Ui^j{ x i = x j} the assertion follows from Proposition 4. Applying the Curve 
Selection Lemma (see for example Milnor (1968) ) we move away from V in one step, and then Proposition 
4 is applicable. □ 

4-4- Valid tests and connectedness 

For many applications of the present techniques, it is only a conjecture that the algorithms are ergodic. 
Consider the manifold Ai p above based on the first four sample moments. Choosing 5 coordinates and 
sampling from the correct conditional distribution on the resulting curve gives a way of moving around on 
Ai p . However it has not been proved that this algorithm is connected; Indeed Proposition 5 of section 4.3 
only shows that the algorithm goes between points in the same connected component (in the topological 
sense) in finitely many steps. 

Bormeshenko (2009) gave a difficult proof that the analogous problem based on changing 3 coor- 
dinates on the manifold determined by the sum and the sum of squares is connected and we certainly 
conjecture this for any number of moments. 

If these samples are used for goodness of fit test, there is a valid test procedure available, even in 
the absence of connectedness, by adapting an idea of Bcsag and Clifford (1989). 

The idea is simple. Let X be the original data. This gives rise to a point Xq on Ai p . Suppose K(x, dy) 
is a Markov chain with the correct stationary distribution on the connected component containing Xq. Fix 
a number of steps T* and run this chain T* steps to get y* say. Then run the time reversed chain, starting 
at y* for T* steps and independently repeat this B* times (starting at y* each time). This results in 
(x*,x*, . . . ,x* B ») £ Aip. The B* + 1 values (xq,xI,x\, . . . ,x* B ,) are exchangeable, so the relative position 
of any test statistic s(xq) among s(x*) is uniform under the null hypothesis. If s(xq) is among the extreme 
values of these statistics then a valid rejection is possible. 
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